arXiv:1501.05269vl [physics.soc-ph] 21 Jan 2015 


Uncovering the spatial structure of mobility networks 


Thomas Louail^’^, Maxime Lenormand^, Miguel Picornell^, Oliva Garcia Cantii^, 
Ricardo Herranz"*^, Enrique Frias-Martinez®, Jose J. Ramasco^, Marc Barthelemy^’®* 

^ Institut de Physique Theorique, CEA-CNRS (URA 2306), F-91191, Gif-sur-Yvette, France 
^ Geographic-Cites, CNRS-Paris 1-Paris 7 (UMR 8504), 13 rue du four, FR-75006 Paris, France 
^IFISC, Institute de Fisica Interdisciplinar y Sistemas Complejos (CSIC-UIB), 

Campus Universitat de les Hies Balears, F-07122 Palma de Mallorca, Spain 
‘^Nommon Solutions and Technologies, calle Cahas 8, F-28043 Madrid, Spain 
® Telefonica Research, E-28050 Madrid, Spain and 
^Centre dAnalyse et de Mathematique Sociales, EHFSS-CNRS (UMR 8557), 

190-198 avenue de France, FR-75013 Paris, France 

The extraction of a clear and simple footprint of the structure of large, weighted and directed 
networks is a general problem that has many applications. An important example is given by origin- 
destination matrices which contain the complete information on commuting flows, but are difficult to 
analyze and compare. We propose here a versatile method which extracts a coarse-grained signature 
of mobility networks, under the form of a 2 x 2 matrix that separates the flows into four categories. 
We apply this method to origin-destination matrices extracted from mobile phone data recorded 
in thirty-one Spanish cities. We show that these cities essentially differ by their proportion of two 
types of flows; integrated (between residential and employment hotspots) and random flows, whose 
importance increases with city size. Finally the method allows to determine categories of networks, 
and in the mobility case to classify cities according to their commuting structure. 


The increasing availability of pervasive data in various 
fields has opened exciting possibilities of renewed quanti¬ 
tative approaches to many phenomena. This is particu¬ 
larly true for cities and urban systems for which different 
devices at different scales produce a very large amount 
of data potentially useful to construct a ‘new science of 
cities’ [T]. 

A new problem we have to solve is then to extract 
useful information from these huge datasets. In partic¬ 
ular, we are interested in extracting coarse-grained in¬ 
formation and stylized facts that encode the essence of 
a phenomenon, and that any reasonable model should 
reproduce. Such meso-scale information helps us to un¬ 
derstand the system, to compare different systems, and 
also to propose models. This issue is particularly strik¬ 
ing in the study of commuting in urban systems. In 
transportation research and urban planning, individuals 
daily mobility is usually captured in Origin-Destination 
(OD) matrices which contain the flows of individuals go¬ 
ing from a point to another (see [UI3]). An OD matrix 
thus encapsulates the complete information about indi¬ 
viduals flows in a city, at a given spatial scale and for 
a specific purpose. It is a large network, and as such 
does not provide a clear, synthetic and useful information 
about the structure of the mobility in the city. More gen¬ 
erally, it is very difficult to extract high-level, synthetic 
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information from large networks and methods such as 
community detection [i] and stochastic block modeling 
(see for example [5] and laiz]) were recently proposed. 
Both these methods group nodes in clusters according to 
certain criteria and nodes in a given cluster have similar 
properties (for example, in the stochastic block model¬ 
ing, nodes in a given group have similar neighborhood). 
These methods are very interesting when one wants to 
extract meso-scale information from a network, but are 
unable to construct expressive categories of links and to 
propose a classification of weighted (directed) networks. 
This is particularly true in the case of commuting net¬ 
works in cities, where edges represent flows of individuals 
that travel daily from their residential neighborhood to 
their main activity area. Several types of links can be dis¬ 
tinguished in these mobility networks, some constitute 
the backbone of the city by connecting major residen¬ 
tial neighborhoods to employment centers, while other 
flows converge from smaller residential areas to impor¬ 
tant employment centers, or diverge from major residen¬ 
tial neighborhoods to smaller activity areas. In addition, 
the spatial properties of these commuting flows are fun¬ 
damental in cities and a relevant method should be able 
to take this aspect into account. 

There is an important literature in quantitative geog¬ 
raphy and transportation research that focuses on the 
morphological comparison of cities [IHII] and notably on 
multiple aspects of polycentrism, ranging from schematic 
pictures proposed by urban planners and architects [T^ 
to quantitative case studies and contextualized compar¬ 
isons of cities [TJUIH] . So far most comparisons of large 
sets of cities have been based on morphological indica¬ 
tors [HIIS]) — built-up areas, residential density, number 
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of sub-centers, etc. — and aggregated mobility indica¬ 
tors |10[lllj — motorization rate, average number of trips 
per day, energy consumption per capita per transport 
mode, etc. —, and have focused on the spatial organi¬ 
zation of residences and employment centers. But these 
previous studies did not propose generic methods to take 
into account the spatial structure of commuting trips, 
which consist of both an origin and a destination. Such 
comparisons based on aggregated indicators thus fail to 
give an idea of the morphology of the city in terms of 
daily commuting flows. We still need some generic meth¬ 
ods that are expressive in a urban context, and that could 
constitute the quantitative equivalent of the schematic 
pictures of city forms that have been pictured for long 
by urban planners m- 

In this paper we propose a simple and versatile method 
designed to compare the structure of large, weighted and 
directed networks. In the next section we describe this 
method in detail. The guiding idea is that a simple and 
clear picture can be provided by considering the distri¬ 
bution of flows between different types of nodes. We 
then apply the method to commuting (journey to work) 
OD matrices of thirty-one cities extracted from a large 
mobile phone dataset. We discuss the urban spatial pat¬ 
terns that our method reveals, and we compare these 
patterns observed in empirical data to those obtained 
with a reasonable null model that generates random com¬ 
muting networks. Finally the method allows determin¬ 
ing categories of networks with respect to their structure, 
and here to classify cities according to their commuting 
structure. This classification highlights a clear relation 
between commuting structure and city size. 


Results 

Extracting coarse-grained information from OD 
matrices 

For the sake of clarity, we will use here the language of 
OD matrices, but the method could easily be applied to 
any weighted and directed network from which we want 
to extract high-level information. 

We assume that for a given city, we have the n x n 
matrix Fij where n is the number of spatial units that 
compose the city at the spatial aggregation level consid¬ 
ered (for example a grid composed of square cells of size 
a, see Methods). This OD matrix Fij represents the num¬ 
ber of individuals living in the location i and commuting 
to the location j where they have their main, regular ac¬ 
tivity (work or school for most people). By convention, 
when computing the numbers of inhabitants and workers 
in each cell we do not consider the diagonal of the OD 
matrix. This means that we omit the individuals who 
live and work in the same cell (considered as ‘immobile’ 
at this spatial scale). 

In order to extract a simple signature of the OD ma¬ 
trix, we proceed in two steps. We first extract both the 


residential and the work locations with a large density 
— the so called ‘hotspots’ (see [16]). The number of res¬ 
idents of cell i is given by its number of 

workers is given by The hotspots then corre¬ 

spond to local maxima of these quantities. It is important 
to note that the method is general, and does not depend 
on how we determine these hotspots. 

Once we have determined the cells that are the resi¬ 
dential and the work hotspots (some cells can possibly 
be both), we proceed to the second and main step of the 
method. We reorder the rows and columns of the OD 
matrix in order to separate hotspots from non-hotspots. 
We put the m residential hotspots on the top lines, and 
do the same for columns by putting the p work hotspots 
on the left columns. The OD matrix then becomes a 
4-quadrants matrix where the flows Fij are spatially po¬ 
sitioned in the matrix with respect to their nature: on 
the top left the individuals that live in hotspots and work 
in hotspots; at the top right the individuals that live in 
hotspots and do not work in hotspots; at the bottom 
left individuals that do not live in hotspots but work in 
hotspots; and finally in the bottom right corner the in¬ 
dividuals that neither live or work in hotspots. For each 
quadrant we sum the number of commuters and normal¬ 
ize it by the total number of commuters in the OD ma¬ 
trix, which gives the proportion of individuals in each of 
the four categories of flows. In other words, for a given 
city, we reduce the OD matrix to a 2 x 2 matrix 

^ = (c r) (1) 

where 

1 = E / E 

is the proportion of Integrated flows that go from resi¬ 
dential hotspots to work hotspots; 

c'= E / E 

is the proportion of Convergent flows that go from ran¬ 
dom residential places to work hotspots; 

^ — E E 

is the proportion of Divergent flows that go from resi¬ 
dential hotspots to random activity places; 

^ ~ E/ '^b V E/ ^b 

iGm+l..n,jGp+l..n 

is the proportion of Random flows that occur ‘at random’ 
in the city, i.e. that are going from and to places that 
are not hotspots. 
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FIG. 1: Illustration of the ICDR method. The 

method decomposes the commuting flows in the city in 
four categories: the Integrated flows (!) from hotspot to 
hotspot, the Convergent flows (C) to hotspots, the 
divergent flows (D) originating at hotspots and finally 
the random flows (i?), which are neither starting nor 
ending at hotspots. For each city with its 
origin-destination matrix, we can compute the 
importance of each commuting flow category and get a 
simple picture of the mobility structure in the city. 

By construction, we have I, C,D,R G [0; 1] and I + 
C + V + R — 1. This matrix A is thus a very simple foot¬ 
print of the OD matrix that gives an expressive picture 
of the structure of commuting in the city, as illustrated 
by Fig. [l] 

Comparison of the mobility networks of thirty-one 
cities 

Commuting data 

Large scale individual mobility networks are nowadays 
extracted from pervasive geolocated data, such as mo¬ 
bile phone, GPS, public transport cards or social apps 
data [IMI]- In particular, if an individual’s mobile 
phone geolocated activity is available during a sufliciently 
long period of time, it is possible — under certain reg¬ 
ularity conditions — to infer the most likely locations 
of her home and her workplace, and by aggregation to 
construct OD matrices [22H^ . Several parameters how¬ 
ever impact the construction of OD matrices such as the 
nature of the data source (survey or user-generated geolo¬ 
cated data), or the spatial scale at which the OD matrix 


is built which can be dictated by administrative units 
(divisions in wards, counties, municipalities, etc.) or by 
technical reasons such as the density of antennas in the 
case of mobile phone data. Given this variety of data col¬ 
lection protocols, it is thus particularly remarkable that 
when considering the commuting flows at a city scale, 
various sources of pervasive data provide a very similar 
mobility information when compared to the OD matrices 
built from surveys |24) . This result needs to be confirmed 
for other cities and countries, but it already opens the 
door to a systematic use of pervasive, geolocated data as 
a relevant substitute to traditional transport surveys. 

In the following we apply our ICDR method to OD 
matrices that have been extracted from mobile phone 
records in thirty-one Spanish urban areas during a five 
weeks period (see the Methods section for details on the 
dataset and the calculation of the OD matrices). 

Hotspots 

As described above, the first step consists in deter¬ 
mining hotspots. Several possible methods have been 
proposed in the literature uniiaiii], and we use here 
a parameter-free method based on the Lorenz curve of 
the densities that we have proposed in a recent study 
(see m and the Methods section). Once we have de¬ 
termined both the origin (residential) hotspots and the 
destination (work) hotspots in each city, we first observe 
how their number scale with the population size of the 
city. Both these numbers for residential and employment 
hotspots scale sublinearly with the population size (see 
Supplementary Figures 4 and 5). The number of work 
hotspots grows significantly slower than the number of 
residential hotspots, showing that residential areas are (i) 
more dispersed in the city, and (ii) are more numerous 
than activity centers, as intuitively expected (see Sup¬ 
plementary Figure 6 that displays the locations of Home 
and Work hotspots in four cities that exhibit different 
spatial organizations). We also note here that the sublin- 
ear scaling of the work hotspots confirms previous results 
obtained with a totally different dataset (the number of 
employment centers in US cities) P7] . 

ICDR values 

We now apply the second part of the method in order 
to calculate the I,C,D and R values for each OD matrix. 
For the 31 Spanish urban areas under study (see Supple¬ 
mentary Figure 1), we obtain the values shown in Fig. 

In Fig. da), we plot these values versus the population 
size of these cities. For this sample of cities we see that 
globally the proportion I of individuals that commute 
from hotspot to hotspot decreases as the population size 
increases, while the proportion R of ‘random’ flows in¬ 
creases and the proportions C and D of convergent and 
divergent flows seem surprisingly constant whatever the 
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city size. In Fig. ^b) we plot the same values but sorted 
by decreasing values of / which shows clearly that the / 
and R values are the relevant parameters for distinguish¬ 
ing cities from each other. 

We also notice that the values obtained for another 
spatial scale of data aggregation confirm this trend (see 
Supplementary Figure 7). The decay of / flows (’inte¬ 
grated’) flows in favor of R flows (’random’) when P in¬ 
creases shows that the population growth among Spanish 
cities goes with a decentralization of both activity places 
and residences. As cities get bigger, their numbers of 
residential and employment hotspots grow (sublinearly), 
but these hotspots catch a smaller part of the commuting 
flows. 


A null model 

In order to evaluate to what extent the ICDR sig¬ 
natures of cities are characteristic of their commuting 
structure, we compare these values to the ones returned 
by a null model of commuting flows. For each city we 
generate random OD matrices of the same size than the 
reference OD matrix but with random flows of individ¬ 
uals that preserve the in- and out- degree of each node 
(see Supplementary Note 5). Figi c) shows the average 
values and standard deviation obtained for 100 replica¬ 
tions. On Fig [^d) we plot the Z-scores of the /, C, 
D, and R values of each city when compared to the val¬ 
ues I*, C*, D* and R* returned by our null model (e.g. 
for the quantity R of the city i, the Z-score is given by 
Z{Ri) = {Ri — R*)/a{R*)). Essentially, we observe that 
the Z-scores of / and R are positive and large, while those 
of C and D are negative (and large in absolute value). 
Also as cities grow, the Z-scores of I and R increase while 
those of C and D decrease. These results demonstrate 
that the larger a city, and the less random it appears. 
This is in contrast with the naive expectation that the 
larger a city and the more disordered is the structure of 
individuals’ mobility. For large cities, there seems to be 
a commuting backbone which cannot result from purely 
random movements of individuals. This backbone is the 
footprint of the city’s structure and history, and probably 
results from strong constraints and efficiency considera¬ 
tions. 


Robustness 

Since our method first requires to determine origin and 
destination hotspots, one could argue that the interpre¬ 
tation of the I,C,D and R values will crucially depend 
on the particular method chosen to define these hotspots. 
The identification of hotspots is a problem that has been 
broadly discussed in urban economics (see Supplemen¬ 
tary Note 3). Roughly speaking, starting from a spatial 
distribution of densities, the goal is to identify the local 
maxima and amounts to choose a threshold p* for the 


density p of individuals: a cell i is a hotspot if the local 
density of people is such that p{i) > p*. In order to test 
the impact of the choice of a particular threshold p* on 
the resulting ICDR values, we measure the sensitivity 
of these values as a function of the density threshold p* 
(see Supplementary Figure 8). As expected, the lower 
the density threshold, the larger the number of origin 
and destination hotspots, and consequently the larger I 
and the smaller R. In contrast, changing the density 
threshold has little impact on the C and D terms. More 
importantly, the conclusions drawn from the compari¬ 
son of the ICDR values across cities remain the same: 
whatever the density threshold p* chosen to define resi¬ 
dential/employment hotspots, we still observe the same 
qualitative behavior: a decay of integrated flows (/) in 
favor of ‘random’ flows (i?), when the population size 
increases. 


Distance of each type of flows 

We now want to characterize spatially these different 
flows and the relation between city size and the commut¬ 
ing distances traveled by individuals. In each city we 
compute the average distance traveled by individuals per 
type of flows I,C,D and R. The resulting average dis¬ 
tances measured in data are plotted in red on Fig. |^a). 
We observe that the average distance for all categories of 
flows increases with population size, an expected effect as 
the city’s area also grows with population size. We also 
observe that the average distance of convergent flows C 
increases faster than for other types of flows (we note 
that the average distance associated to convergent flows 
increases more than the distance associated to divergent 
flows D, showing that these flows are not symmetric as 
one could have naively expected). This result means that 
for this set of Spanish cities, commuters from small resi¬ 
dential areas to important activity centers travel on aver¬ 
age a longer distance than all other individuals. This ob¬ 
servation could be an indication that for our set of cities, 
residential areas have expanded while activity centers re¬ 
mained at their location, leading to longer commuting 
distances (see Supplementary Figures 10 and 11 when 
considering various spatial scales of aggregation). 

Another interesting information is provided by the 
comparison of distances measured in the data with av¬ 
erage distances measured from the random OD matrices 
generated by the null model. The average distances asso¬ 
ciated to the null model are plotted in blue in Fig. |^a). 
We see that for all types of flows the distances measured 
in the empirical data are shorter than those generated 
by the null model. This is another clear indication of 
the spatial organization of individual flows in cities. It 
also highlights the importance of the travel time budget 
in the residential locations choice. Remarkably enough, 
the distance of convergent flows (C) is both the largest 
and the one that increases the fastest with population, 
indicating a low degree of efficiency. 
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FIG. 2: Results for 31 Spanish cities, (a) I (integrated), C (convergent), D (divergent) and R (random) values 
versus population size for 31 Spanish urban areas, (b) Same ICDR values as in (a) but sorted by decreasing order of 
I (note that by definition, we have for each city I + C + D + R=l). It is remarkable that I and R dominate and 
seem almost sufficient to distinguish cities, while C and D are almost constant whatever the city size (see 
Supplementary Figure 7 for the values obtained with another size a of grid cells), (c) /, C, D, R average values and 
standard deviations obtained for 100 replications of a null model, where the inflow and outflow at each node are 
kept constant while flows are randomly distributed at random between nodes, (d) Z-scores obtained by comparing 
the empirical data and the values returned by the null model. Large values of Z-scores show that the actual 
commuting networks cannot be considered as resulting from connecting the nodes at random. The /, C, D and R 

values of a specific city are then a signature of its structure. 


The comparison of this behavior with the null model 
leads to interesting results. On Fig. [^b) we plot the 
ratio Df^uii/Duata for the four types of flow. Values 
lower than 1 indicate that the average commuting dis¬ 
tance generated by the null model is shorter than the 
distance observed in the city. Surprisingly, we observe 
that small cities display a value less than one indicating 
the lesser importance of space at this short scale. We 
also see that this ratio increases faster for random flows 
(i?) than for the others {D, C, I), suggesting a remarkable 


spatial structure of these R flows. 

We also consider the fraction of total commuting dis¬ 
tance by type of flow (Fig.|^. We see that for each type of 
flows, their respective fraction is constant and indepen¬ 
dent of city size. With the LouBar hotspots detection 
method [TB] (see Supplementary Figure 3) and with a 
grid of 1km? square cells, we measure that roughly 40% 
of the total commuting distance is made on random flows 
while the other types represent each about 20%. This re¬ 
sult shows that the method is able to identify where most 
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FIG. 3: Distance per type of flow and population size, (a) Average commuting distance vs. population size, 
per type of flows; (b) ratio D^uii/Djjata per type of flows. The LouBar criteria m is used here to define residential 

and employment hotspots. 


of the commuting distance is traveled. In particular, we 
see that the natural, obvious flows (I) from residential 
centers to activity centers are not the most important 
ones, and that the decentralization of commuting flows 
seems to be the rule for the Spanish cities in our sample. 


Classification of cities 

Finally, the ICDR signature of their OD matrix al¬ 
lows to cluster cities with respect to the structure of 
their commuting patterns. We measure the euclidian 
distance between the cities’ ICDR signatures and we 
then perform a hierarchical cluster analysis. Fig.j^shows 
the dendrogram resulting from the classification. Four 
well-separated clusters are identified on this dendrogram, 
and Table gives the average value of each term along 


with the average population of the cities composing the 
cluster. Remarkably these summary statistics show that 
largest cities are clustered together and are characterized 
by a larger proportion of ‘random’ flows (i?) of individu¬ 
als both living and working in parts of the city that are 
not the dominant residential and activity centers. This 
can be interpreted as an increased facility in bigger ur¬ 
ban areas to commute from any part of the city to any 
other part. Further studies on other cities and countries 
are needed at this stage in order to discuss the relevance 
of the proposed classification. 

It is also important to test the robustness of this clas¬ 
sification and we show that introducing a reasonable 
amount of noise in the OD matrices does not change 
the classification (see Methods and Supplementary Fig¬ 
ure 12). This sensitivity test confirms that the clustering 
is robust against possible errors in the data source and 
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FIG. 4: Total commuting distance. Contributions to the total commuting distance of each type of flows vs. 
population size. We observe that the variations are small and that the largest contribution comes from the R flows. 


in the extraction of the mobility networks. The classifi¬ 
cation of cities based on their ICDR values is also rea- 
sonnably robust to a change of the method used to define 
residential and employment hotspots (see Supplementary 
Figure 13). 


Discussion 

We have proposed a method to extract high-level infor¬ 
mation from large weighted and directed networks, such 
as origin-destination matrices. This method relies on the 
identification of origin and destination hotspots, and this 
first step can be performed with any reasonable method. 
The important second step consists in aggregating flows 
in four different types, depending whether they start and 
end from/to a hotspot or not. 

We have applied this method to commuting networks 
extracted from mobile phone data available in thirty-one 
Spanish cities. The method has allowed us to highlight 
several remarkable patterns in the data: 
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• Independently of the density threshold chosen to 
determine hotspots, the proportion of integrated 
flows (/) decreases with city size, while the propor¬ 
tion of random flows (i?) increases; 

• On average and for all cities considered here, in¬ 
dividuals that live in residential main hubs and 
that work in employment main hubs (/ flows) travel 
shorter distances than the others {C,D,R flows); 


FIG. 5: Classification of cities. Dendrogram 
resulting from the hierarchical clustering on cities based 
on their ICDR values. In front of each city name we 
indicate its rank in the hierarchy of population sizes. 
The largest cities are clustered together. As cities get 
bigger, the ‘random’ component (i?) of their commuting 
flows increases, which signals that it is easier to 
commute from any place to any other in large cities. 


• When the city size increases, the largest impact 
is on convergent flows (C) of individuals living in 
smaller residential areas (typically in the suburbs) 
and commuting to important employment centers; 


• The classification of cities based on the ICDR val¬ 
ues leads to groups with consistent population size, 
highlighting a clear relationship between the popu¬ 
lation size of cities and their commuting structure. 










































Cluster Cities 


P 


I R D C 


Orange Salamanca, Gijon, Cordoba, ... 255,330 0.43 0.27 0.16 0.14 

Dark blue La Coruna, Zaragoza, Santander, Elche, ... 392,970 0.37 0.36 0.15 0.13 

Green Cartagena, Palma, Granada, ... 732,992 0.31 0.41 0.16 0.13 

Light blue Murcia, Barcelona, Bilbao, Madrid, ... 2,463,551 0.25 0.46 0.17 0.12 


TABLE I; Classification of cities. Average ICDR values and average population sizes of the cities composing 
each of the four clusters represented on Fig. As the population grows the proportion of Random flows increases 
while the proportion of Integrated decreases. The weights of Convergent and Divergent flows stay constant among 

the groups. 


In addition, the comparison with a null model led to 
interesting conclusions. Flows in cities display a high 
level of spatial organization and as the population size of 
the city grows, the increase of the Z-scores of I,C,D,R 
shows that the structure of the mobility is more and more 
specific, and far from a random organization. We note 
that an interesting direction for future research would be 
to find some analytical arguments using simple models 
of city organization for estimating these flows, and how 
they vary with population size. 

Our coarse-graining method provides a large scale pic¬ 
ture of individual flows in the city. In this respect it could 
be particularly useful for validating synthetic results of 
urban mobility models (such as [5S] for example), and 
for comparing different models. An accurate modeling of 
mobility is indeed crucial in a large number of applica¬ 
tions, including the important case of epidemic spreading 
which needs to be better understood, in particular at the 
intra-urban level [29l [30] . 

It would also be interesting to apply the proposed 
method to other mobility datasets, with different time 
and spatial scales, in order to test the robustness of these 
commuting patterns. Another direction for future studies 
could be to inspect the time evolution of the Ifi,D and 
R values for datasets describing travel to work journeys 
over several decades (such as national travel surveys). 
The method could also be applied at larger spatial scales, 
for example to capture dominant effects in international 
migration flows. More generally, we believe that an im¬ 
portant feature of the ICDR method is its versatility, as 
it could be applied to any type of data that is naturally 
represented by a weighted and directed network. 

Methods 

Data 

The dataset used for our analysis comprises 55 days of 
aggregated and anonymized records for 31 urban areas of 
more than 200,000 inhabitants. No individual informa¬ 
tion or records were available for this study. The records 
included the set of Base Transceiver Stations (communi¬ 
cation antennas, BTS) used for the communications as 
presented in the Call Detail Records (CDR). A CDR is 
produced for each active phone event, including call/sms 


sending or reception. The number of anonymized users 
represents on average 2% of the total population and at 
most 5% of the total population. These percentages are 
almost the same for all the urban areas. From the CDR 
data obtained for 20 weekdays (from mondays to thrus- 
days only), we extracted home and work places for all 
the anonymized mobile phone users in the dataset. The 
output of this processing phase is an OD commuting ma¬ 
trix for each urban area, at the scale of the BTS point 
pattern. In order to facilitate the calculations and the 
comparison of the results between different cities, the OD 
matrices are then transposed on regular square cells grids 
of varying size a (see Supplementary Note 2). 

Extraction of OD matrices from mobile phone data 

In order to extract OD matrices from phone calls, we 
select a subset of users with a mobility displaying a suf¬ 
ficient level of statistical regularity. For this analysis we 
considered commuting patterns during workdays only. 
The users’ Home and Work locations are identified as 
the Voronoi cells which are the most frequently visited on 
weekdays by each user between 8 pm and 7 am (Home) 
and between 9 am and 5 pm (Work). We assume that 
there must be a daily travel between the home and work 
locations of each individual. Users with a call activity 
larger than 40% of the days under study at home or work 
are considered as valid. We then aggregate the complete 
flow of users and construct the OD matrix with the flows 
between a Voronoi cell classified as home and another cell 
classified as a work place. Since the Voronoi areas do not 
exactly match the grid cells, we use a transition matrix 
to change the spatial scale of the OD matrix, that is to 
transform the Fij values of the OD matrix where i and j 
are Voronoi cells into F^,^, values where i' and j' are the 
cells of a regular grid (see Supplementary Note 2, and 
Supplementary Figure 2 for an example of the partition 
an urban area in BTS Voronoi cells). 

Spatial scale of the OD matrix 

The OD matrix is the standard object in mobility stud¬ 
ies and transport planning [2| and contains information 
about movement of individuals in a given area. More 




9 


precisely, an OD matrix is a n * m matrix where n is the 
number of different ‘Origin’ zones, m is the number of 
‘Destination’ zones and Fij is the number of people com¬ 
muting from place i to place j during a given period of 
time. In transport surveys the size of the OD matrix de¬ 
pends on the spatial scale at which the mobility data has 
been collected. Traditionally the zones that are used to 
partition the city are the administrative units, whose size 
can vary from census and electoral units to whole depart¬ 
ments or states, depending on the purpose for building 
the OD matrix. 

In this study we applied our ICDR method to cities 
divided in square cells that are smaller than administra¬ 
tive units, allowing for a better spatial resolution. In the 
case of OD matrices extracted from CDR mobile phone 
data, the maximal resolution corresponds to the BTS 
(antennas) point pattern. The ICDR method proposed 
in this paper does however not depend on a particular 
spatial scale and can be applied on OD matrices avail¬ 
able at coarser spatial resolutions as well. For a given 
territory the results obtained with the ICDR method - 
the I,C,D and R values in the first place - will obviously 
depend on the spatial resolution and will also depend on 
the method used to define hotspots. It is important to 
note that when the ICDR method is used for compar¬ 
ing cities, the spatial resolution and the hotspots iden¬ 
tification method should be the same for all cities (see 
Supplementary Notes 4 and 6, and Supplementary Fig¬ 
ures 8 and 9 for results obtained with another hotspots 
delimitation method, and Supplementary Figure 7 for re¬ 
sults obtained when considering another spatial scale of 
aggregation). 

Robustness of the classification of cities 

In order to ensure that the classification of cities based 
on their ICDR matrices is robust, we introduce a noise 
in the flows Fij (for all the thirty-one cities). We focus 
on the case where the workplace of an individual can be 
modified and where the number of individuals living in 
each cell i is kept constant. More precisely, the noise is 
introduced as follows: 

1. We pick up a uniform random positive integer g, the 
number of individuals whose workplace is reshuf¬ 
fled. This number g varies from I to iV = ^ Fij 
the total number of commuters in the city. 

2. We repeat g times the following operation: we pick 
up randomly a residence and a workplace (a cou¬ 
ple of values {i,j)) s-nd move one individual from 
her workplace j to put another randomly chosen 
workplace j': Fij —)• Fij — 1 and Fij^ —>• Fiji -|- 1 

The parameter of this workplace reshuffling is then / = 
g/N. In order to evaluate how much the classification of 
cities is affected by this noise, we compute the Jaccard 
index J/ between the reference classification of cities in 


k groups and the classification in k groups obtained for 
the noisy OD matrices. The Jaccard index measures the 
similarity of two partitions P and P' of same size: 


where 

• a : number of city pairs that are in the same group 
for both P and P' 

• b : number of city pairs that are in different groups 
in P but in the same one in P' 

• c : number of city pairs that are in the same group 
in P but not in P' (or conversely) 

The Jaccard index Jj is in the range [0; 1] and the closest 
it is to 1, the larger the similarity between P and P'. 

We generate 100 noisy matrices for each value of / and 
compute the average value J/ of the Jaccard index. This 
average value encodes the distance between the reference 
partition P of cities in k groups and the partitions of 
cities in k groups obtained for the noisy OD matrices. 
Supplementary Figure 12 shows the values of Jj versus 
the proportion / of reshuffled individuals, for different 
number of groups k. The red shaded rectangle on each 
panel corresponds to the mean value +/- the standard 
deviation obtained for 1, 000 replications of a null model, 
in which permutations of cities among the k groups are 
randomly performed. We observe here that up to 20% 
of reshuffled individuals, the average value J/ obtained is 
always significantly larger than the one obtained for the 
null model, indicating that the classification is robust 
even for important values of the noise. 
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Supplementary Information 
Supplementary Figures 



Figure SI. The 31 biggest Spanish urban areas under 
study. 
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Figure S3. Illustration of the ‘LouBar’ hotspots 
determination method. We show the Lorenz curve and 
how we determine the thresholds for selecting hotspots (see 
details in Supplementary Note[|. 



Figure S2. Map of the metropolitan area of 

Barcelona. The white area represents the metropolitan 
area, the dark grey area represents territory surrounding the 
metropolitan area and the gray area the sea. (a) Voronoi 
cells, (b) Intersection between the Voronoi cells and the 
metropolitan area. 
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Figure S4. Scaling of hotspots, (a) Number of residential 
hotspots vs. population size of the city ; (b) Number of work 
hotspots vs. population size of the city ; The scaling relation 
is sublinear, indicating an ’economy of scale’ in the spatial 
organisation of cities. The exponent is remarkably smaller in 
the case of the work/main daily activity hotspots, which 
means that in Spanish cities the number of important places 
that concentrate many jobs and daily amenities, grows 
slower than the number of important residential places. 
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Figure S5. Effect of grid size on scalings. Numbers of 
residential hotspots and work hotspots (defined with the 
LouBar method, see section [J vs. the population size of the 
city for different sizes a of grid cells. In any case the scaling 
relations for both quantities stay sublinear, indicating an 
’’economy of scale” in the spatial organisation of cities. Also 
remarkable is that the scaling exponent of the number of 
residential centers is always larger than the exponent of 
work centers. 



(a) (b) 

Madrid Barcelona 



Figure S6. Location maps of residential and work 
hotspots in four cities. Each square cell represents a zone 
of 4 km^. Blue cells are residential hotspots only, red cells 
are work hotspots only, and green cells are simultaneously 
residential and work hotspots. These maps exhibit different 
spatial organisations of residential and work hotspots. 


Figure S7. Effect of grid size on ICDR values. / 

(integrated),C (convergent),!) (divergent) and R (random) 
values of the 31 Spanish urban areas ranked by decreasing 
values of !, for a = 2km. While more dispersed, the data 
exhibit the same pattern than the one obtained with 
a — 1km and represented on Figure 2 in the main text. 
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Figure S8. Sensitivity of ICDR values with the 
density threshold. I (integrated),C (convergent),!) 
(divergent) and R (random) values of the 31 Spanish cities 
ranked by decreasing as a function of the density threshold 
chosen to determine origin and destination hotspots of the 
network. 



Figure S9. ICDR values obtained with the ’Average’ 
criteria. I (integrated),!) (convergent),!) (divergent) and R 
(random) values of the 31 Spanish cities ranked by 
decreasing order of 7, for a — 1km and when dehning 
hotspots with respect to the ’Average’ criteria (i.e. all cells 
whose density of residents/workers is greater than the mean 
value are considered as residential/employment hotspots. 
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Figure SIO. Commuting distance per flow type 
(a = Ikm). Contribution to the total commuting distance 
of each type of flow vs. population size, for two different 
hotspots definition methods, and for a = 1km. These 
contributions do not depend on population size. 
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Figure Sll. Commuting distance per flow type 

(a = 2km). Contribution to the total commuting distance 
of each type of flow vs. population size, for two different 
hotspots definition methods, and for a = 2km. These 
contributions do not depend on population. 



Figure S13. Effect of hotspots definition on clusters. 

Jaccard index J/ resuming the sensitivity of the clusters to 
the threshold chosen to determine hotspots. The light red 
shaded rectangle on each panel corresponds to the mean 
value +/- the standard deviation obtained for 1000 
replications of a null model, in which permutations of cities 
among the k groups are performed randomly. 



T-1-1-1-TT-1-1-1-TT-1-1-1-TT-1-1-1-TT-1-1-1-T 

i=n/N: proportion of commuters whose workplace is changed 

Figure S12. Effect of noise on clusters. Jaccard index Jj 
resuming the sensitivity of the clusters to some uniform 
noise introduced in the commuting flows for different 
numbers of groups k. The light red shaded rectangle on each 
panel corresponds to the mean value +/- the standard 
deviation obtained for 1000 replications of a null model, in 
which permutations of cities among the k groups are 
performed randomly. 
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Supplementary Notes 

Supplementary note 1 

Comparing properties of cities of very different popu¬ 
lation sizes and areas requires to rely on a harmonized, 
spatial delimitation of cities that goes beyond the ad¬ 
ministrative boundaries defined by national authorities. 
These administrative boundaries may be somewhat arbi¬ 
trary and do not follow neither the morphological nor 
functionnal spatial footprint of the city. To this end 
we have chosen to rely on the urban areas defined by 
the AUDES initiative (Areas Urbanas De ESpana: AU- 
DES project. Documentation and open data available at 
http;//alarcos.esi.uclm.es/per/fruiz/audes/ (ac¬ 
cessed January 27, 2014)). This project is an attempt to 
capture some coherent spatial delimitations of Spanish 
cities regarding the (journey to work) commuting pat¬ 
terns of individuals living in the core city of each urban 
area and in their surrounding municipalities. These de¬ 
limitations are built upon statistical criteria based on 
the proportion of residents of surrounding municipalities 
that commute to the main city to work. The locations 
and spatial boundaries of the 31 Spanish urban areas 
analysed in this study are represented on Supplementary 
Eigure SI. 


Supplementary note S 


We note in Equation that we remove the intersec¬ 
tion of the Voronoi area with the sea, indeed, we assume 
that the number of users calling from the sea are neg- 
ligeable. Now we consider the number of mobile phone 
users Ny and the associated area Ay of the Voronoi cells 
intersecting the UA (see Supplementary Figure S2b). 

Extraction of Origin-Destination matrices We cannot 
directly extract an OD matrix between the grid cells with 
the mobile phone data because each users’ home and 
work locations are identified by the Voronoi cells. Thus, 
we need a transition matrix P to transform the BTS OD 
matrix B into a grid OD matrix G. 

Let m be the number of Voronoi cells covering the ur¬ 
ban area and n be the number of grid cells. Let B be 
the OD matrix between BTSs where Bij is the number of 
commuters between the BTS i and the BTS j. To trans¬ 
form the matrix B into an OD matrix between grid cells 
G we define the transition matrix P where Pij is the area 


of the intersection between the grid cell i and the BTS j. 
Then we normalize P by column in order to consider a 
proportion of the BTSs areas instead of an absolut value, 
thus we obtain a new matrix P (Equation ^ . 


p.. =__ fqj 

Z^fe=l ^kj 

The OD matrix between the grid cells G is obtained by 
a matrices multiplication given in the following equation: 


Voronoi cells We remove the BTSs with zero mobile 
phone users and we compute the Voronoi cells associated 
with each BTSs of the urban area (hereafter called UA). 
We remark in Supplementary Figure S2a that there are 
four types of Voronoi cells: 

1. The Voronoi cells contained in the UA 

2. The Voronoi cells partitionned between the UA and 
the territory outside the UA. 

3. The Voronoi cells between the UA and the sea 
(noted S). 

4. The Voronoi cells between the UA, the territory 
outside the UA and the sea. 

To compute the number of users associated with the 
intersections between the Voronoi cells and the UA we 
have to take into account these different types of Voronoi 
cells. Let m be the number of Voronoi cells, Ny the num¬ 
ber of mobile phone users in the Voronoi cell v and Ay 
the area of the Voronoi cell v, v S |[l,m]|. The number 
of users Nyf-^uA in the intersection between v and UA is 
given by the following equation: 


NynuA = Ny 



vnuA 


( 3 ) 


G = PBP* 


( 5 ) 


Supplementary note 3 

A city is divided in n cells and the data give us ac¬ 
cess to its n X n OD matrix extracted from the mobile 
phone data. After straightforward calculation we obtain 
the distributions {Gr) and {Gw) of the numbers of resi¬ 
dents/workers in the n cells composing the city. The de¬ 
termination of centres and subcentres is a problem which 
has been broadly tackled in urban economics (see for ex¬ 
ample the references given in the paper and references 
in |16]L Starting from a spatial distribution of densities, 
the goal is to identify the local maxima. This is in princi¬ 
ple a simple problem solved by the choice of a threshold 
5 for the density p: a cell i is a hotspot if the density 
of users p{i) > S. It is however clear that such method 
based on a fixed threshold introduce some arbitrariness 
due to the choice of <5, and also requires prior knowledge 
of the city to which it is applied to choose a relevant 
value of 6. In [16] we proposed a generic method to de¬ 
termine hotspots from the Lorenz curve of the densities. 
In the following we quickly introduce the principle of the 
method and its application to the determination of the 
residential and work hotspots of each city. We invite the 
interested readers to refer to |16j for further discussion 
on this method. 


Ay — A„ns 
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We first sort {Cr) and {Cw) in increasing order, and 
denote the ranked values by C]^{resp. Cjy) < C]^ < ... < 
where n is the number of cells. The two Lorenz curves 
of the distribution of residents/workers are constructed 
by plotting on the x-axis the proportion of cells F = 
ijn and on the y-axis the corresponding proportion of 
commuters L with: 


m = 


EUcj, 


( 6 ) 


broader acceptation of what is an hotspot, it is obvious 
that the term I will increase drastically since we increase 
both the number of residential hotspots and the number 
of work hotspots. Still what is important to notice is that 
we can still observe the qualitative trend observed with 
the LouBar method. As the population size increases, 
the decrease of the proportion I of integrated flows is ac¬ 
companied by an increase of the proportion R of random 
flows. 

Supplementary note 5 


If all the cells had the same number of residents/workers 
the Lorenz curves would be the diagonal from (0, 0) to 
(1,1). In general we observe a concave curve with a 
more or less strong curvature. In the Lorenz curve, the 
stronger the curvature the stronger the inequality and, 
intuitively, the smaller the number of hotspots. This re¬ 
mark allows us to construct a criterion by relating the 
number of dominant places (i.e. those that have a very 
high number of residents/workers compared to the other 
cells) to the slope of the Lorenz curve at point F = 1: 
the larger the slope, the smaller the number of dominant 
values in the statistical distribution. The natural way 
to identify the typical scale of the number of hotspots 
is to take the intersection point F* between the tangent 
of L{F) at point F = 1 and the horizontal axis L = 0 
(see Suplementary Figure S3). This method is inspired 
from the classical scale determination for an exponential 
decay: if the decay from F = 1 were an exponential of 
the form exp—(1 — F)/a where a is the typical scale we 
want to extract, this method would give 1 — F* = a (see 
[l6] for further discussion). 


In order to evaluate to what extent the ICDR values 
of a given city are characteristic of its commuting struc¬ 
ture, we compare these values to the values returned by a 
reasonable null model of commuting flows. The guiding 
idea is to generate OD matrices that (i) have the same 
size than the city’s OD matrix (i.e that contain the same 
total number of individuals) ; (ii) that respect the city’s 
static spatial organisation (i.e. the in- and out-degrees of 
all nodes should stay constant) ; and (hi) that randomize 
the flows between the nodes, i.e. with different weights 
of the edges. Such a null model of flows that respects the 
static organization of the city is indeed more reasonnable 
and realistic than a null model that would respect the to¬ 
tal number of individuals in the matrix but that would 
modify the in- and out-degrees of the nodes. 

To generate a random graph that conserves the in- and 
out- degree of each node of the reference graph, we use 
the Molloy-Reed algorithm m which complexity is in 
0(n), where n is the sum of the weights of the edges (i.e. 
the number of individuals in the OD case). 


Supplementary note 4 

On Supplementary Figure S8 we plot the I,C,D,R val¬ 
ues of the 31 Spanish urban areas as a function of the 
density threshold p* chosen to define hotspots (here de¬ 
fined relatively to the density value plb returned by the 
’LouBar’ method - see section]])). 

In the extreme case represented on Supplementary 
Figure S9 all cells whose number of residents/workers 
are greater than the mean value of the distribution of 
residents/workers are tagged as hotspots (see [TB] for 
a discussion of this criteria and his comparison to the 
’’LouBar” criteria used in this study). With a much 


Supplementary note 6 

In order to evaluate the sensitivity of the classification 
of cities to the number of hotspots selected in cities, for 
each city i we make vary the number of work hotspots 
between FL^j the reference value returned by the LouBar 
method (see Supplementary Note 3), and two times the 
reference value 2 x Fl^^: -I- 5),<5 G [0; 1]. As for 

the sensitivity to noise in the flows Cij, we evaluate the 
stability of the classification of cities in k groups with the 
Jaccard index (see Supplementary Note]]. The values of 
Ji as a function of S are represented on Supplementary 
Figure S13. 



